Bank Term Deposit Classification Project (Ensemble Technique)

Data Description: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Domain: Banking

Context: Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.

Attribute Information

Learning Outcomes

Table of Contents

Import Packages

Reading the data as a dataframe and print the first fifteen rows

Get info of the dataframe columns

Observation 1 - Dataset shape

Dataset has 45211 rows and 17 columns, with no missing values.

Exploratory Data Analysis

Performing exploratory data analysis on the bank dataset. Below are some of the steps performed:

Five point summary of numerical attributes and check unique values in 'object' columns

Observation 2 - information on the type of variable and min-max values

Client info
Last contact info
This campaign info
Previous campaign info
Target

Observation 3 - Descriptive statistics for the numerical variables

Descriptive statistics for the numerical variables (age, balance, duration, campaign, pdays, previous)

Checking the distribution of target variable

Observation 4 - Distribution of target variable

Out of 45211 cases, only 5289 (=11.69%) are the cases where the client has subscribed to the term deposit.

Univariate and Bivariate Visualization

Looking at one feature at a time to understand how are the values distributed, checking outliers, checking relation of the column with Target column (bi).

Observation 5 - Comments from categorical columns

Check outlier and distribution for numerical columns and also plot it's relation with target variable

Observation 6 - Comments from numerical columns

Use MICE imputer to handled outliers that were filled with np.nan in the earlier step

Observation 7 - Observation after MICE

Column Before MICE After MICE
age Range of Q1 to Q3 is 33-48. Mean > Median, right (positively) skewed Range of Q1 to Q3 is unchanged, because of change in min and max values there's a slight reduction is mean, right skewed
balance Range of Q1 to Q3 is 72-1428. Mean > Median, skewed towards right (positively) Range of Q1 to Q3 is 81 to 1402, reduction in mean, right skewed
duration Range of Q1 to Q3 is 103-319. Mean > Median, right (positively) skewed Range of Q1 to Q3 is 106-316, right skewed
campaign Range of Q1 to Q3 is 1-3. Mean > Median, right (positively) skewed Unchanged range and skewness
pdays 75% of data values are around -1 Unchanged
previous 75% of data values are around 0 Unchanged

Checking whether count of 0 in previous column is equal to count of -1 in pdays column

Observation 8 - pdays and previous

Count of 0 in previous is equal to count of -1 in pdays column, we might replace -1 in pdays with 0 to account for cases where the client wasn't contacted previously. Checking correlation between variables and target next...

Multivariate visualization

Checking relationship between two or more variables. Includes correlation and scatterplot matrix, checking relation between two variables and Target.

Check scattermatrix

Correlation matrix

Observation 9 - Correlation Matrix

Creating age groups and check relation with balance and target; also with campaign and target

Observation 10 - Comments

Modelling

Dummy Classifier -- Baseline Model

Logistic Regression, kNN and Naive Bayes

Oversampling the one with better accuracy and recall score for subscribe

Logistic Regression

k-Nearest Neighbor Classifier

Naive Bayes Classifier

Oversampling and Naive Bayes

Oversampling and Logistic Regression

Ensemble Techniques

Decision Tree Classifier, Bagging Classifier, AdaBoost Classifier, Gradient Boosting Classifier and Random Forest Classifier. Oversampling the ones with higher accuracy and better recall for subscribe.

Decision Tree Classifier

Bagging, AdaBoost, Gradient Boosting Classifier

Oversampling and AdaBoost Classifier

Random Forest Classifier

Oversampling and Random Forest Classifier

Comparing model results

Conclusion

The classification goal is to predict if the client will subscribe (yes/no) a term deposit.

Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the clients subscribing to term deposit (class 'yes' i.e. 1) is 11.7% whereas those about 88.3% of the clients didn't subscribe (class 'no' i.e. 0) to the term deposit.

Building a DummyClassifier, baseline model, in our case gave an accuracy of 88.2% with zero recall and precision for predicting minority class i.e. where the client subscribed to term deposits. In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.

The confusion matrix for class 1 (Subscribed) would look like:

Predicted: 0 (Not Subscribed) Predicted: 1 (Subscribed)
Actual: 0 (Not Subscribed) True Negatives False Positives
Actual: 1 (Subscribed) False Negatives True Positives

In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.

Modeling was sub-divided in two phases, in the first phase we applied standard models (with and without the hyperparameter tuning wherever applicable) such as Logistic Regression, k-Nearest Neighbor and Naive Bayes classifiers. In second phase apply ensemble techniques such as Decision Tree, Bagging, AdaBoost, Gradient Boosting and Random Forest classifiers. Oversampling the ones with higher accuracy and better recall for subscribe.

Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.

In the first phase (Standard machine learning models vs baseline model),

In the second phase (Ensemble models vs baseline model),